Matrix processing instruction with optional up/down sampling of matrix

ABSTRACT

A processor system comprises a shared memory and a processing element. The processing element includes a matrix processor unit and is in communication with the shared memory. The processing element is configured to receive a processor instruction specifying a data matrix and a matrix manipulation operation. A manipulation matrix based on the processor instruction is identified. The data matrix and the manipulation matrix are used to perform a matrix operation to determine a result matrix.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/708,224 entitled MATRIX PROCESSING INSTRUCTION WITH OPTIONAL UP/DOWNSAMPLING OF MATRIX filed Dec. 9, 2019 which is incorporated herein byreference for all purposes.

BACKGROUND OF THE INVENTION

A whole class of complex artificial intelligence problems can be solvedusing neural networks. The implementation of neural network solutions isoften dependent on how the input source or intermediate data isformatted and the requirements of neural network operations. Neuralnetwork operations may expect the data in a particular format. It iscommon to convert data from one matrix format to another to improve theaccuracy and computational cost for implementing neural networkoperations. Traditionally, the conversion is challenging to adapt tohardware solutions and is performed in software. It is a challenge tocreate a hardware solution that is both flexible and offers significantperformance improvement and efficiency. Therefore, a flexible andefficient hardware solution for performing matrix manipulationoperations, including conversion operations for up-sampling anddown-sampling matrices, is needed.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a block diagram illustrating an embodiment of a system forperforming matrix manipulation operations.

FIG. 2 is a block diagram illustrating an embodiment of a processingelement for performing matrix manipulation operations.

FIG. 3 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit.

FIG. 4 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit.

FIG. 5 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit.

FIG. 6 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit.

FIG. 7 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation.

FIG. 8 is a diagram illustrating an example input data matrix andcorresponding matrix slice for performing a matrix manipulationoperation.

FIG. 9 is a diagram illustrating an example result matrix fromperforming a matrix manipulation operation.

FIG. 10 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation.

FIG. 11 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

A processor system for performing hardware accelerated matrixmanipulation operations using processor instructions is disclosed. Thematrix manipulation operations supported by the processor system includeat least different up-sampling and down-sampling operations. In thedisclosed processor system, the processor system includes a hardwarematrix processor unit that performs matrix operations such as a matrixmultiplication. The processor system supports one or more matrixmanipulation operation processor instructions. For example, a processorinstruction directs the processor system to up-sample an input datamatrix by duplicating each element along a row. As another example, theelements down a column of the input data matrix can also be duplicated,effectively doubling a two-dimensional matrix along both dimensions. Insome embodiments, the processor system supports up-sampling a matrix bylinear interpolation. Each up-sampling, down-sampling, or another matrixmanipulation operation can be performed and described by a designatedmanipulation matrix. In various embodiments, the designated manipulationmatrices corresponding to the different supported matrix manipulationoperations are stored in memory. For example, each manipulation matrixof a supported matrix manipulation operation can be stored in memory asa pre-defined matrix. In some embodiments, a manipulation matrix can bedynamically programmed and stored in memory. In response to a specifiedmatrix manipulation processor instruction, the appropriate manipulationmatrix is loaded from memory into the matrix processor unit of theprocessor system. A corresponding input data matrix is multiplied by themanipulation matrix using the matrix processor unit. The output resultcan be written to memory and/or used by the processor system forsubsequent operations, such as matrix operations required for neuralnetwork inference or training. In some embodiments, the output iswritten to memory using strided writes and/or a memory layout unit toup-sample the result matrix of the matrix processor unit in a seconddimension. For example, an up-sampling manipulation matrix up-samplesthe input data matrix in a first dimension, such as doubling the lengthof each row. Each up-sampled row is then written to memory twice using amemory layout unit to up-sample the input data matrix along the heightdimension, effectively doubling the length of each column. Theduplicative writes effectively double the size of the final matrix byduplicating the number of rows. In some embodiments, the result matrixof the matrix processor unit is written to memory in two passes to evenand then odd (or vice versa) memory row addresses using a memory layoutor scatter unit. The memory layout unit can be used to quickly andefficiently output an up-sampled input data matrix to memory.

In some embodiments, a processor system comprises a shared memory and aprocessing element in communication with the shared memory. For example,a processing element may be a single processing element or one of amatrix of processing elements that each has access to the shared memoryvia a communication bus. A matrix of processing elements may be a 32×32,64×64, or another sized matrix of processing elements. The processingelement of the processor system includes a matrix processor unit. Forexample, a matrix processor unit is configured to perform at least amatrix multiplication on two matrix operands to determine a resultmatrix. In some embodiments, the matrix processor unit includes a firsttype of register configured to store all values of a single row of adata matrix and a group of a second type of registers, wherein each ofthe second type of registers is configured to store all values of adifferent column of a manipulation matrix. The matrix processor unitalso includes a plurality of vector calculation units, wherein each ofthe plurality of vector calculation units corresponds to one of thesecond type of registers. Each vector calculation unit is configured tomultiply each value stored in the first type of register with acorresponding value stored in the corresponding one of the second typeof registers. The multiplication results of the corresponding vectorcalculation unit are summed to at least in part determine acorresponding element in a result matrix of multiplying the data matrixwith the manipulation matrix.

In some embodiments, the processing element is configured to receive aprocessor instruction specifying a data matrix and a matrix manipulationoperation. For example, a specialized processor instruction includes adata matrix operand and specifies a matrix manipulation operation suchas an up-sample or down-sample operation. The processing element isfurther configured to identify a manipulation matrix based on theprocessor instruction. For example, the processor instruction is decodedto identify a manipulation matrix corresponding to the matrixmanipulation operation. In some embodiments, the manipulation matrix isa hardcoded matrix stored in memory or another memory location. Theprocessing element is configured to load the data matrix and themanipulation matrix into the matrix processor unit and perform a matrixoperation to determine a result matrix. For example, in someembodiments, each column of the manipulation matrix is loaded into avector computational unit of the matrix processor unit. For each row ofthe data matrix, the row is loaded or broadcasted to every vectorcomputational unit with a corresponding column of the manipulationmatrix. Each vector computational unit computes a dot-product resultcorresponding to an element in the result matrix. The processing elementis configured to output the result matrix to a destination location. Forexample, the result matrix may be outputted to memory or anotherlocation such as a matrix register. In some embodiments, the outputtingperforms an up-sampling of the result matrix along one dimension of thedata matrix. For example, each row is written out twice to duplicate thenumber of rows. In some embodiments, the manipulation matrix is anup-sampling, down-sampling, or another type of manipulation matrix forperforming a matrix manipulation operation.

FIG. 1 is a block diagram illustrating an embodiment of a system forperforming matrix manipulation operations. In the example shown, system100 includes processing element 101 and memory 161. Processing element101 includes manipulation matrix input unit 103, data matrix input unit105, matrix processor unit 107, and output unit 151. Matrix processorunit 107 is a dot product engine and can perform matrix multiplicationon two input matrices, a data matrix and a manipulation matrix. In someembodiments, matrix processor unit 107 includes multiple vector units(not shown) used to compute a matrix multiplication. Matrix processorunit 107 receives a manipulation matrix (not shown) from manipulationmatrix input unit 103 and an input data matrix (not shown) from datamatrix input unit 105 to perform the result of multiplying the inputdata matrix by the manipulation matrix. The result is outputted tooutput unit 151, which can be used to write the matrix multiplicationresults to memory 161. For example, in some embodiments, atwo-dimensional manipulation matrix is prepared by manipulation matrixinput unit 103 and successive input data vectors of a two-dimensionalinput data matrix are prepared by data matrix input unit 105. Thetwo-dimensional manipulation matrix and the two-dimensional input datamatrix may be retrieved from memory 161 and may be referenced by amemory address. In some embodiments, the input data matrix is referencedby a memory address and the manipulation matrix is determined by thematrix manipulation operation specified by a processor instruction. Thetwo matrices are multiplied and the output is received at output unit151. In some embodiments, the result matrix is computed one row eachcycle by loading one row of the input data matrix each cycle into matrixprocessor unit 107.

In some embodiments, a processor instruction directed to processingelement 101 references an input data matrix and a specific manipulationmatrix. For example, the manipulation matrix may be a manipulationmatrix for performing an up-sampling or a down-sampling operation.Moreover, the manipulation matrix may be one of several different typesof up-sampling or down-sampling matrices or a matrix corresponding toanother matrix manipulation operation. For example, for up-samplingoperations, an up-sampling manipulation matrix may up-sample by doublingevery row element, by performing linear interpolation between elements,by quadrupling every row element, or by using another up-samplingscheme. In various embodiments, the manipulation matrices are hardcodedin memory 161 and/or stored in another memory location. In someembodiments, the manipulation matrices for each matrix manipulationoperation may be dynamically configured and stored in memory 161 and/oranother memory location. In response to a matrix manipulation operationprocessor instruction, processing element 101 loads the propermanipulation matrix from memory 161 into matrix processor unit 107 viamanipulation matrix input unit 103 and the corresponding input datamatrix from memory 161 into matrix processor unit 107 via data matrixinput unit 105. In some embodiments, the dimensions of the input datamatrix are larger than are supported by matrix processor unit 107 andthe input data matrix is processed as two-dimensional slices of theinput data matrix, where matrix processor unit 107 supports thedimensions of the two-dimensional slices. For example, a 32×32 matrixprocessor unit can receive 32×32 slices of a much larger input datamatrix. In various embodiments, the final matrix resulting from thematrix manipulation operation can have different dimensions from theinput data matrix. For example, an up-sampling matrix manipulationoperation results in a larger final matrix and a down-sampling matrixmanipulation operation results in a smaller final matrix.

In various embodiments, the arrows of FIG. 1 represent the directiondata moves through the components of system 100. For example, the arrowsmay correspond to multi-element wide communication/data buses and/ordata lines. In some embodiments, output unit 151 includes a memorylayout or scatter unit (not shown) for implementing strided writes ofthe result matrix to memory. For example, output unit 151 stores theresult matrix from matrix processor unit 107 and writes out the matrixto memory 161. Each row is written to memory twice using a memory layoutunit to up-sample the input data matrix along the height dimension,effectively doubling the length of each column. In some embodiments,additional components of system 100 and processing element 101 are notshown. For example, a control logic unit for processing and decodingprocessor instructions is not shown. As another example, in someembodiments, processing element 101 includes local storage memory (notshown) that is used to store one or more manipulation matrices used forimplementing one or more matrix manipulation operations. In response toa processor instruction, manipulation matrix input unit 103 loads acorresponding manipulation matrix from local storage memory, bypassingthe time required to load data from memory 161. In some embodiments, amanipulation matrix stored in local storage memory can be dynamicallyand programmatically configured.

In some embodiments, processing element 101 is one of a plurality ofprocessing elements (not shown) connected to memory 161. Memory 161 is ashared memory that each of the plurality of processing elements canaccess. For example, the processing elements may be arranged as a matrixof processing elements such as a grid of 32×32 processing elements. Theprocessing elements can be configured to operate in parallel with oneanother. For example, different processing elements can performdifferent matrix manipulation operations in parallel. In someembodiments, the different processing elements can perform portions ofthe same matrix manipulation operation in parallel but on differentslices of an input data matrix. The final result matrix may be writtenout to memory 161 as a complete result matrix by different processingelements writing their respective partial result matrices to memory. Theperformance of matrix manipulation operations is significantly increasedby spreading the processing across different processing elements, witheach processing element performing a portion of the matrix manipulationoperation on an assigned portion of the input data matrix.

FIG. 2 is a block diagram illustrating an embodiment of a processingelement for performing matrix manipulation operations. In the exampleshown, processing element 200 includes matrix processor unit 201,manipulation matrix input unit 203, data matrix input unit 205, andoutput unit 251. Matrix processor unit 201 includes multiple vectorunits including at least vector units 211 and 221. Each vector unitincludes at least a vector multiply unit and a vector adder unit. Forexample, vector unit 211 includes vector multiply unit 213 and vectoradder unit 215. And vector unit 221 includes vector multiply unit 223and vector adder unit 225. In various embodiments, matrix processor unit201 includes at least the number of vector units to match the number ofelements in an input data vector generated by data matrix input unit205. In various embodiments, matrix processor unit 201 is configured toreceive two input matrices, each matrix a two-dimensional matrix viamanipulation matrix input unit 203 and data matrix input unit 205,respectively, and output a matrix result to output unit 251. In someembodiments, processing element 200 is processing element 101 of FIG. 1and matrix processor unit 201, manipulation matrix input unit 203, datamatrix input unit 205, and output unit 251 are matrix processor unit107, manipulation matrix input unit 103, data matrix input unit 105, andoutput unit 151, respectively, of FIG. 1. In some embodiments,processing elements, such as processing element 200, and multiple matrixprocessor units, such as matrix processor unit 201, may be utilized inparallel for increased performance. For example, one processing elementand its matrix processor unit can be used to process one slice of alarge input data matrix and another processing element and its matrixprocessor unit can be used to process a separate slice of the same inputdata matrix.

In some embodiments, manipulation matrix input unit 203 is used to loada manipulation matrix into matrix processor unit 201 as separate vectoroperands corresponding to different columns of the manipulation matrix.For example, data corresponding to at least a portion of atwo-dimensional manipulation matrix can be read from memory andprocessed by manipulation matrix input unit 203 before being loaded intomatrix processor unit 201. In various embodiments, each vector operandgenerated by manipulation matrix input unit 203 may be directed to anyone of the vector units of matrix processor unit 201, such as vectormultiply unit 213 or 223. Each vector unit can be loaded with adifferent corresponding column of the manipulation matrix. For example,in some embodiments, matrix processor unit 201 includes 32 vector units.Over 32 cycles, 32 vector operands can be loaded into matrix processorunit 201 via manipulation matrix input unit 203. For each cycle, onevector operand is generated by manipulation matrix input unit 203 andthen loaded into one of the 32 vector units. After 32 cycles, all 32vector units have received a vector operand, each corresponding to acolumn of a 32-column manipulation matrix. In some embodiments, multipledata input vectors can be generated and loaded each cycle. For example,four input vectors can be generated in parallel to load 32 vector unitsin 8 cycles.

In some embodiments, data matrix input unit 205 is used to load an inputdata matrix into matrix processor unit 201 as separate vector operandscorresponding to different rows of the input data matrix. For example,data corresponding to at least a portion of a two-dimensional input datamatrix can be read from memory and processed by data matrix input unit205 before being loaded into matrix processor unit 201. Each input datavector operand generated by data matrix input unit 205 corresponds to arow of the input data matrix and can be directed to any one, subset, orall of the vector units of matrix processor unit 201, such as vectormultiply unit 213 or 223. For example, the same input data vectoroperand can be broadcasted to multiple vector units of matrix processorunit 201 to compute an entire output row of the modified matrix result.By broadcasting the same vector operand corresponding to a row of theinput data matrix to multiple vector units, multiple vector unitscompute a dot product of the same data matrix row with differentmanipulation matrix columns in parallel. Once the results of an entirerow of the modified matrix are determined, a vector operandcorresponding to the next row of the input data matrix can bebroadcasted to the appropriate vector units to determine the next outputrow of the modified matrix. In some embodiments, each row of themanipulation matrix is instead broadcasted to vector units correspondingto the different columns of the input data matrix.

In some embodiments, some elements of the vector operands may be unusedor zeroed out. For example, an up-sampling manipulation operation maycorrespond to a 16×32 manipulation matrix that utilizes 16-elementvectors for each column and/or a 32×16 input data matrix that utilizes16-element vectors for each row. Each of the 32 vector units of a 32×32matrix processor unit is loaded with a pair of 16-element vectorscorresponding to a column of the 16×32 manipulation matrix and a row ofthe 32×16 input data matrix. The 16-element vector operand may be a32-element vector with 16 zero-value or padding elements. The vectoroperands are prepared by manipulation matrix input unit 203 and/or datamatrix input unit 205. Similarly, in some embodiments, only a subset ofthe vector units of matrix processor unit 201 is utilized. For example,a down-sampling manipulation operation may correspond to a 32×16manipulation matrix that utilizes 32-element vectors for each column butonly requires 16 vector units to load the entire 32×16 manipulationmatrix into a 32×32 matrix processor unit. The vector operands areprepared by manipulation matrix input unit 203 and/or data matrix inputunit 205 and directed to the appropriate vector units.

In some embodiments, input vector operands generated by manipulationmatrix input unit 203 and data matrix input unit 205 are passed asvector arguments to a vector unit, such as one of vector units 211 and221, of matrix processor unit 201. Each vector unit of matrix processorunit 201 may determine a dot product result using the input vectorscorresponding to a row of an input data matrix and a column of amanipulation matrix. In some embodiments, matrix processor unit 201includes 32 vector units. Each vector unit may take two 32-elementvectors as arguments and each can produce a single element result. Takenacross all utilized vector units, the results are an output vectorresult and correspond to an output row of the modified matrix. Invarious embodiments, the output of matrix processor unit 201 can be anoutput vector and is received at output unit 251. Over multiple cycles,the output received at output unit 251 is a matrix result. In someembodiments, the output vector received at output unit 251 is a32-element vector. Other vector lengths may be utilized as appropriate.For example, a 16-element vector can be outputted by taking the outputsof only 16 of the 32 vector units. Similarly, the size of the elementsprocessed by processing element 200 can be configured as appropriate.For example, elements may be 4-bits, 8-bits, 2-byte, 4-bytes, or anotherappropriate size.

In some embodiments, the number of cycles required to load a vectoroperand from memory via manipulation matrix input unit 203 and/or datamatrix input unit 205 into matrix processor unit 201 is based on theutilization of the matrix processor unit. For example, to keep matrixprocessor unit 201 near full utilization, data arguments for the vectorunits are retrieved from memory and prepared over a time period (e.g., acertain number of cycles) that closely matches the compute utilizationof the vector units. By matching the load and compute times, matrixprocessor unit 201 can be kept near full utilization. In someembodiments, data read times are reduced, for example, by increasing thebus speed, to better match the load and compute times. For example, invarious embodiments, matrix processor unit 201 may take approximatelyeight clock cycles to complete a certain set of computations. (Anexample of a set of computations might include applying eight differentrows of an input data matrix to a set of input vectors corresponding toa manipulation matrix.) A read rate of one vector operand per cyclewould require at least 32 cycles to load all vector units. Increasingthe read rate by a factor of four allows all 32 vector operands to beloaded in approximately 8 cycles, matching the processing compute timeof the matrix processor unit. In various embodiments, by matching thedata read speed, for example, the data bus speed used to load vectoroperands, with matrix processor unit compute performance and workload,the overall efficiency and throughput of matrix processor unit 201 issignificantly increased. In some embodiments, the read speed is at leastin part increased using the techniques disclosed herein. For example,multiple vector operands corresponding to different columns of themanipulation matrix may be generated in parallel by manipulation matrixinput unit 203 to multiply the overall effective read speed. In someembodiments, manipulation matrix input unit 203 may process multipleinput vectors in parallel to reduce the number of cycles required toload a corresponding manipulation matrix into matrix processor unit 201.

In some embodiments, matrix processor unit 201 includes multiple vectorunits that each include a vector multiply and vector adder unit. Eachvector multiply unit, such as vector multiply unit 213 or 223, isconfigured to multiply corresponding elements received via manipulationmatrix input unit 203 and data matrix input unit 205. In someembodiments, the result is a vector of multiplication results. Forexample, for two 32-byte input vectors, the result of a vector multiplyunit is a vector of 32 multiplication results. The first element of aninput data matrix row from data matrix input unit 205 is multiplied withthe first element of a manipulation matrix column from manipulationmatrix input unit 203. Similarly, the second element of an input datamatrix row is multiplied with the second element of a manipulationmatrix column. In various embodiments, the vector of multiplicationresults is passed to a vector adder unit of the vector unit. Forexample, vector multiply unit 213 passes its multiplication results tovector adder unit 215 and vector multiply unit 223 passes itsmultiplication results to vector adder unit 225.

In some embodiments, each vector adder unit, such as vector adder unit215 or 225, is configured to compute the sum of the elements from aninput vector. For example, the sum of each of the elements from a vectorof multiplication results computed by vector multiply unit 213 iscomputed by vector adder unit 215. Similarly, the sum of each of theelements from a vector of multiplication results computed by vectormultiply unit 223 is computed by vector adder unit 225. In someembodiments, the result of a vector adder unit is a dot product of thevectors used as input to the corresponding vector multiply unit. Invarious embodiments, each vector adder unit, such as vector adder unit215 or 225, is implemented as an adder tree. For example, the top levelof an adder tree may add pairs of elements to determine a set of partialsums, such as adding elements 0 and 1 to determine a first partial sumand elements 2 and 3 to determine a second partial sum, etc. Eachsubsequent level may sum pairs of partial sums from the previous leveluntil the last level computes a final result sum. In variousembodiments, each adder tree computes partial sums in parallel to arriveat a result sum. The parallel operation significantly improves theefficiency of summing a vector of numbers. In various embodiments,multiple vector units can operate in parallel to compute multiple dotproducts in parallel, significantly improving the throughput of matrixmanipulation operations.

FIG. 3 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit. For example, amatrix manipulation operation is performed using a hardware matrixprocessor unit in response to receiving a processor instruction at aprocessing element. The instruction specifies that type of matrixmanipulation operation, such as a type of up-sampling, down-sampling, oranother appropriate operation to perform and the source input datamatrix to perform the operation on. In some embodiments, the resultingmatrix has different dimensions than the input matrix. For example, anup-sampling matrix manipulation operation results in a longer lengthalong the up-sampled dimension and a down-sampling matrix manipulationoperation results in a shortened length along the down-sampleddimension. In some embodiments, the matrix manipulation operation isperformed using matrix processor unit 107 of FIG. 1 and/or matrixprocessor unit 201 of FIG. 2.

At 301, a matrix manipulation operation processor instruction isreceived. For example, a processor instruction specifying a matrixmanipulation operation, such as a type of up-sampling, down-sampling, oranother appropriate matrix manipulation operation is received at aprocessing element. An up-sampling operation may correspond to doublingthe width of the input matrix by repeating every element. Anotherup-sampling operation may correspond to nearly doubling the width of theinput matrix by linearly interpolating every other element. Other matrixoperations are appropriate as well. The processor instruction alsospecifies an input data matrix, such as a two-dimensional data matrixstored in memory. The input data matrix may be referenced by a memorylocation such as a memory address in memory, a register location, oranother memory reference. In some embodiments, the memory location is alocal memory of the processing element.

At 303, a manipulation matrix operand is prepared. For example, amanipulation matrix corresponding to the matrix manipulation operationof the processor instruction received at 301 is identified and preparedfor a matrix processor unit. The manipulation matrix may be retrievedfrom memory, local memory of the processing element, or another memorylocation, such as a matrix register. In some embodiments, themanipulation matrix operand is prepared by preparing operand vectorscorresponding to each column of the manipulation matrix. Each vectoroperand of the manipulation matrix can be loaded into correspondingvector units of the matrix processor unit. In some embodiments, thematrix processor unit operates on vector sizes larger than the columnlength of the manipulation matrix and only a subset of the vectorelements are used. For example, unused vector elements of a vectoroperand are replaced with zero-value or padding elements. A 16-elementcolumn vector is stored in a 32-element vector operand using 16 elementsfrom the appropriate manipulation matrix column and another 16zero-value elements. In some embodiments, the manipulation matrixoperand is prepared by a manipulation matrix input unit such asmanipulation matrix input unit 103 of FIG. 1 and/or manipulation matrixinput unit 203 of FIG. 2.

At 305, an input data matrix operand is prepared. For example, an inputdata matrix is prepared for a matrix processor unit to perform thematrix manipulation operation of the processor instruction received at301. The input data matrix may be retrieved from memory, local memory ofthe processing element, or another memory location, such as a matrixregister. In some embodiments, the input data matrix operand is preparedby preparing operand vectors corresponding to each row of the input datamatrix. Each vector operand can be broadcasted to vector units of thematrix processor unit that receive a corresponding column of themanipulation matrix at 303. In some embodiments, the matrix processorunit operates on vector sizes larger than the row length of the inputdata matrix and only a subset of the vector elements are used. Forexample, unused vector elements of a vector operand are replaced withzero-value or padding elements. In some embodiments, the input datamatrix operand is prepared by a data matrix input unit such as datamatrix input unit 105 of FIG. 1 and/or data matrix input unit 205 ofFIG. 2.

In some embodiments, the input data matrix is larger in dimensions thansupported by the matrix processor unit and/or manipulation matrix. Forexample, a matrix processor unit may operate on matrices up to 32×32elements. In the event the input data matrix has dimensions larger than32×32 and/or is a size incompatible with the manipulation matrix, theinput data matrix is sliced into appropriate two-dimensional matrixslices compatible with the matrix processor unit and manipulationmatrix. For example, an up-sampling manipulation matrix may utilize a16×32 manipulation matrix. The input data matrix is sliced into 32×16input data slices that are compatible with both a 32×32 matrix processorunit and the 16×32 manipulation matrix to output an up-sampled resultmatrix. In the event there are multiple input data slices, the matrixmanipulation operation may be performed on each slice. In someembodiments, such as interpolation operations, the slices may overlap.

At 307, the matrix manipulation operation is applied. For example, usingthe manipulation matrix operand and the input data matrix operandprepared at 303 and 305, respectively, a two-dimensional matrixmanipulation operation is performed by a matrix processor unit. In someembodiments, the matrix manipulation operation is performed over anumber of cycles, operating on one row of the input data matrix (orinput data matrix slice) at a time to determine one row of an outputmatrix at a time. For example, the matrix processor unit may output asingle vector result each cycle corresponding to one row of the resultmatrix. Each element of a row vector is determined by computing a dotproduct of one row of the input data matrix against a different columnof the matrix manipulation. In various embodiments, the output of thematrix manipulation operation is a result matrix determined bymultiplying the input data matrix by the manipulation matrix. In someembodiments, the output result is received by an output unit such asoutput unit 151 of FIG. 1 and/or output unit 251 of FIG. 2.

At 309, the manipulated operation result is outputted. For example, theresulting matrix is outputted to memory or another location. In someembodiments, the resulting matrix is written to memory such as memory161 of FIG. 1. In some embodiments, the resulting matrix is written to amatrix register, which can be one or more registers for storing a matrixfor future access by a processing element. The outputting functionalitymay perform additional matrix manipulation. For example, the outputtingmay be performed using multiple passes and strided writes to up-samplethe matrix along a height dimension. In some embodiments, the stridedwrites are performed using a memory layout or scatter unit. Using thesame output matrix of the matrix processor unit, a final matrix isoutputted by writing every row of the output matrix to every other rowof the final matrix over two passes. To output a 32×32 matrix using a16×32 output matrix, the first pass fills in the odd rows (e.g., rows 1,3, 5, . . . , and 31) and the second pass fills in the even rows (e.g.,rows 2, 4, 6, . . . , and 32).

In some embodiments, the output unit may enforce the output dimensionsof the result matrix. For example, a down-sampled input data matrix issmaller in at least one dimension than the input data matrix, such aswhen down-sampling a 32 column input matrix to 16 columns. In someembodiments, each of 32 vector units of a 32×32 matrix processor unit iscapable of outputting a single dot product result each cycle. Instead ofutilizing the output of all 32 vector units, the output of the matrixprocessor unit is trimmed to the 16 elements corresponding to the 16columns of the down-sampled row. In some embodiments, the outputdimensions are in part determined by an output unit such as output unit151 of FIG. 1 and/or output unit 251 of FIG. 2. For example, an outputunit selects the output from only vector units that have applicableresults for the result matrix.

FIG. 4 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit. For example, aprocessor instruction is received at a processing element and decoded toprepare a manipulation matrix for performing a matrix manipulationoperation. The matrix processor unit applies the manipulation matrix toinput data specified by the processor instruction to perform the matrixmanipulation operation. In some embodiments, the process of FIG. 4 isperformed at 301 and/or 303 of FIG. 3. In some embodiments, the processof FIG. 4 is performed by system 100 of FIG. 1 using a processingelement such as processing element 101 of FIG. 1 and/or processingelement 200 of FIG. 2.

At 401, a matrix manipulation operation processor instruction isdecoded. For example, a processor instruction specifying a specificmatrix manipulation operation, such as a type of up-sampling,down-sampling, or another appropriate matrix manipulation operation, isdecoded at a processing element. In some embodiments, the decoding isperformed by a control logic unit of the processing element. Theprocessor instruction directs the processing element to perform a matrixmanipulation operation on an input data matrix. In some embodiments, thedecoding includes determining the specific matrix manipulation operationand associated manipulation matrix, determining the input data matrixand its dimensions, and determining the output result, its dimensions,and the destination to output the result to. In some embodiments, thedecoding also determines the output functionality associated withoutputting the results. For example, the output functionality mayinclude parameters for strided writes to further up-sample the output.

At 403, the manipulation matrix is identified. For example, every matrixmanipulation operation is associated with a manipulation matrix and, at403, the manipulation matrix for the decoded processor instruction isidentified. In some embodiments, the identification includes determiningthe memory location, such as a memory address location or matrixregister, of the manipulation matrix. In some embodiments, theidentification includes determining the dimensions of the manipulationmatrix.

At 405, the manipulation matrix is retrieved from memory. For example,the manipulation matrix is retrieved from memory via a manipulationmatrix input unit. In some embodiments, the manipulation matrix inputunit is manipulation matrix input unit 103 of FIG. 1 and/or manipulationmatrix input unit 203 of FIG. 2. In some embodiments, the manipulationmatrix is retrieved from local memory of the processing element, amatrix register, or another appropriate memory location identified at403. In some embodiments, the manipulation matrix is retrieved onecolumn at a time over multiple cycles.

At 407, the manipulation matrix is loaded into the matrix processorunit. For example, the manipulation matrix is loaded into a matrixprocessor unit via a manipulation matrix input unit. In someembodiments, the manipulation matrix input unit loads the manipulationmatrix into the matrix processor unit one column vector at a time. Forexample, each column of the manipulation matrix is processed into avector operand and loaded into a corresponding vector unit of the matrixprocessor unit. In some embodiments, multiple cycles are needed to loadan entire manipulation matrix into the matrix processor unit. Onceloaded into the matrix processor unit, the manipulation matrix can bereused and applied to different rows of the input data matrix. In someembodiments, the dimensions of the manipulation matrix are smaller thanthe largest matrix supported by the matrix processor unit and only asubset of the vector units of the matrix processor unit are utilized.For example, a 32×16 down-sampling manipulation matrix only requires 16vector units, one for each of the 16 columns of the manipulation matrix.Each of the 16 vector units receives a 32-element vector correspondingto one of the 16 columns.

FIG. 5 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit. For example, amatrix manipulation operation is performed on an input data matrixspecified by a processor instruction. The input data matrix may be toolarge to load into the matrix processor unit or the dimensions areincompatible with the corresponding manipulation matrix. Instead, theinput data matrix is sliced into smaller matrices compatible with theoperand size of a matrix processor unit and the manipulation matrix.Each slice is processed by the matrix processor unit by applying acorresponding manipulation matrix loaded into the matrix processor unitto each input data matrix slice. In some embodiments, thecompatible-sized slices can be passed as operands to one or moredifferent matrix processor units and the results combined. In someembodiments, the slices are operated on different matrix processorunits, for example, matrix processor units corresponding to differentprocessing elements. In various embodiments, the process of FIG. 5 maybe performed in response to a matrix manipulation operation instructionreceived at 301 of FIG. 3. In some embodiments, the process of FIG. 5 isperformed at 305, 307, and/or 309 of FIG. 3. In some embodiments, theprocess of FIG. 5 is performed by system 100 of FIG. 1 using aprocessing element such as processing element 101 of FIG. 1 and/orprocessing element 200 of FIG. 2 and a matrix processor unit such asmatrix processor unit 107 of FIG. 1 and/or matrix processor unit 201 ofFIG. 2.

At 501, the next input data matrix slice of the input data matrix isidentified. For example, an input data matrix is sliced into one or moreinput data matrix slices with sizes compatible with the operand size ofthe matrix processor unit and the manipulation matrix. In someembodiments, the slices overlap. The slices may be identified in memoryand a read request may be issued to load the identified data. In someembodiments, it may be common for the size of the input data matrix tobe extremely large compared to the operand size of the matrix processorunit. The input data matrix is sliced into smaller compatible sizes forprocessing. At 501, the next slice is identified for processing.

At 503, the manipulation matrix is applied to the input data matrixslice. For example, an input data matrix slice is multiplied by themanipulation matrix using a matrix processor unit. The resulting matrixmay be received at an output unit of the processing element. In someembodiments, the matrix manipulation operation is performed on the inputdata matrix slice over a number of cycles, operating on one row of theinput data matrix slice at a time to determine one row of an outputmatrix at a time. For example, the matrix processor unit may output asingle vector result each cycle corresponding to one row of the resultmatrix. Each element of a row vector is determined by computing a dotproduct of one row of the input data matrix slice against a differentcolumn of the matrix manipulation.

At 505, manipulation matrix results are outputted. For example, eachvector unit of the matrix processor unit determines an element of anoutput vector. The output vector may correspond to a complete row of aresult matrix and is received at an output unit such as output unit 151of FIG. 1 and/or output unit 251 of FIG. 2. In various embodiments, theoutput unit gathers vector unit results over multiple iterationscorresponding to multiple rows of the result matrix. The output unitwrites the result matrix to memory, such as memory 161 of FIG. 1, oranother appropriate memory location. In some embodiments, the resultmatrix is a slice of a larger result matrix, where the larger resultmatrix is the result of applying the matrix manipulation operation tothe original input data matrix.

In some embodiments, the outputting functionality may perform additionalmatrix manipulation as described with respect to step 309 of FIG. 3. Forexample, the outputting may be performed using multiple passes andstrided writes to up-sample the matrix along a height dimension. Theoutput unit may also enforce the output dimensions of the result matrix.The output dimensions may be determined in part by an output unit suchas output unit 151 of FIG. 1 and/or output unit 251 of FIG. 2. Forexample, an output unit selects the output from only vector units thathave applicable results for the result matrix.

At 507, a determination is made whether additional data matrix slicesrequire processing. In the event an additional data matrix slice remainsto be processed, processing loops back to 501 to process the next slice.In the event no additional data matrix slice remains to be processed,processing ends.

FIG. 6 is a flow chart illustrating an embodiment of a process forperforming a matrix manipulation operation using a processor instructionto a processing element with a matrix processor unit. For example, amatrix manipulation operation specified by a processor instruction isperformed on an input data matrix compatible with the operand size of amatrix processor unit and the manipulation matrix. In some embodiments,the input data matrix is an input data matrix slice. Each input datamatrix is processed by the matrix processor unit by applying acorresponding manipulation matrix loaded into the matrix processor unitto each input data matrix. In some embodiments, each row of the inputdata matrix is processed as an input data vector operand and broadcastedto multiple vector units of the matrix processor unit. Each of themultiple vector units of the matrix processor unit also receives acorresponding column of the manipulation matrix as a second vectoroperand. The dot product results calculated by the vector units togetherform a row of the result matrix. In various embodiments, the process ofFIG. 6 may be performed in response to a matrix manipulation operationinstruction received at 301 of FIG. 3 and/or may be performed on aninput data matrix slice identified at 501 of FIG. 5. In someembodiments, the process of FIG. 6 is performed at 305 and/or 307 ofFIG. 3 and/or at 503 of FIG. 5. In some embodiments, the process of FIG.6 is performed by system 100 of FIG. 1 using a processing element suchas processing element 101 of FIG. 1 and/or processing element 200 ofFIG. 2 and a matrix processor unit such as matrix processor unit 107 ofFIG. 1 and/or matrix processor unit 201 of FIG. 2.

At 601, the next input data vector from the input data matrix slice isidentified. For example, an input data vector corresponding to a row ofthe input data matrix slice is identified and prepared for a matrixprocessor unit. In some embodiments, the data is read from memory. Invarious embodiments, the input data vector is a vector operand for thematrix processor unit prepared by a data matrix input unit such as datamatrix input unit 105 of FIG. 1 and/or data matrix input unit 205 ofFIG. 2. During each pass through step 601, the next input data vectorcorresponding to a row of the input data matrix slice is identified.Subsequent passes identify and process a different row until all rowsand the entire input data matrix slice have been processed. In someembodiments, the input data vector only utilizes a subset of the vectorlength supported by the matrix processor unit. For example, the inputdata vector may have 16 elements even though the matrix processor unitcan receive 32-element vector operands. In some embodiments, the unusedvector elements are filled with padding elements and/or zero-valueelements.

At 603, the input data vector is broadcasted to applicable vector units.For example, the input data vector identified at 601 is prepared as avector operand and broadcasted to selected vector units of the matrixprocessor unit. The selected vector units each receive two vectoroperands, a vector operand corresponding to the input data vector and avector operand corresponding to a column of the manipulation matrix. At603, the applicable vector units each receive the vector operandcorresponding to the input data vector. Depending on the matrixmanipulation operation, a subset or all vector units of the matrixprocessor unit are utilized. For example, a 32×32 matrix processor unitmay utilize all 32 vector units in the case where the manipulationmatrix has 32 columns. In the case where the manipulation matrix has 16columns, only 16 vector units are utilized and the input data vector canbe broadcasted only to the applicable 16 vector units. In variousembodiments, the vector operands corresponding to each column of themanipulation matrix can be reused across multiple input data vectors.The applicable vector units only receive a new input data vector at 603.

At 605, vector unit operations are performed and the results areoutputted. For example, every vector unit loaded with vector operandsfrom a corresponding row of the input data matrix slice and acorresponding column of the manipulation matrix performs a dot productoperation and outputs the resulting element to an output vector as aresult. The results of the vector units correspond to a row of theresult matrix. The length of the resulting output row is based on thenumber of vector units utilized. For example, in the event 16 vectorunits are utilized, each output row has 16 elements. Similarly, in theevent 32 vector units are utilized, each output row has 32 elements, andso forth. In various embodiments, the dot product operation performed byeach vector unit is performed by utilizing a vector multiply unit and avector adder unit of each vector unit. In some embodiments, the outputvector is received at an output unit such as output unit 151 of FIG. 1and/or output unit 251 of FIG. 2. The output unit can output theresulting row (or collection of rows accumulated over time) to memory oranother appropriate location.

At 607, a determination is made whether additional input data vectorsrequire processing. In the event an additional input data vector remainsto be processed, processing loops back to 601 to process the next inputdata vector. In the event no additional input data vector remains to beprocessed, processing ends.

FIG. 7 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation. In FIG. 7, up-sampling manipulation matrix 700 represents anexemplar 16×32 manipulation matrix for performing up-sampling of a 32×16input data matrix (not shown). The result of multiplying an input datamatrix by up-sampling manipulation matrix 700 is to up-sample the rowsof the input data matrix by a factor of 2 by repeating every elementtwice. Each element along a row of a 32×16 input data matrix isduplicated. Other dimensions for a manipulation matrix for up-samplingmay be appropriate as well. Up-sampling manipulation matrix 700 isconfigured for a matrix processor unit with at least 32 vector units,where each vector unit takes vector operands with at least 16 elements.Up-sampling manipulation matrix 700 may be utilized by a 32×32 matrixprocessor unit, where the 32-element vector operands are padded with 16padding or zero-value elements. In some embodiments, the matrixprocessor unit is matrix processor unit 107 of FIG. 1 and/or matrixprocessor unit 201 of FIG. 2. In some embodiments, the processes ofFIGS. 3-6 are used to apply a manipulation matrix to an input datamatrix.

In the example shown, up-sampling manipulation matrix 700 is a 16×32manipulation matrix with 16 rows and 32 columns. Each column ofup-sampling manipulation matrix 700 contains a single element with avalue of 1.0. All remaining elements of the column have a value of 0.0.Each pair of columns has the 1.0 value element at the same row location.As the columns progress along the row dimension, the row location of the1.0 value element changes. Columns 701 and 703 have the 1.0 valueelement at row 1, columns 705 and 707 have the 1.0 value element at row2, and so forth, with column 709 having the 1.0 value element at row 16.The ellipses shown in up-sampling manipulation matrix 700 indicateadditional elements not shown to fill out the 16×32 manipulation matrixusing the described pattern.

The columns of up-sampling manipulation matrix 700, such as columns 701,703, 705, 707, and 709, among others, are each loaded as vector operandsinto a corresponding vector unit of the matrix processor unit. Forexample, column 701 is a 16-element vector that is prepared as a vectoroperand for a first vector unit. A dot product is determined using a rowof the input data matrix and the vector operand of column 701 todetermine the first element of a row result. Similarly, a dot product isdetermined using the same row of the input data matrix with the vectoroperand of column 703 to determine the second element of the row result.Using the same row of the input data matrix, dot products are determinedwith the vector operands of columns 705 and 707 to determine the thirdand fourth elements, respectively, of the row result. The remaining rowelements are similarly determined using the remaining columns ofup-sampling manipulation matrix 700. The last element of the row resultis computed by determining the dot product using the same row of theinput data matrix with the vector operand of column 709. Usingup-sampling manipulation matrix 700, each row result has 32 elements.

FIG. 8 is a diagram illustrating an example input data matrix andcorresponding matrix slice for performing a matrix manipulationoperation. Input data matrix 800 is a two-dimensional matrix. The sizeof input data matrix 800 may be larger than the supported dimensions ofthe matrix processor unit and/or may be incompatible in its originalformat with the applicable manipulation matrix. Input data matrix 800 issliced into smaller matrices compatible with both the matrix processorunit and the manipulation matrix. For example, input matrix slice 801 isa two-dimensional matrix slice of input data matrix 800 compatible witha 32×32 matrix processor unit and up-sampling manipulation matrix 700.Input matrix slice 801 is a 32×16 matrix and up-sampling manipulationmatrix 700 is a 16×32 matrix. The result matrix from multiplying inputmatrix slice 801 by up-sampling manipulation matrix 700 is a 32×32result matrix that converts the first 16 columns of input data matrix800 to a 32×32 output matrix. In various embodiments, different slicesof input data matrix 800 are each processed and the results combined toup-sample the entire input data matrix. The input data matrix may besliced into different dimensions depending on the matrix processor unitand the corresponding manipulation matrix. For some manipulationmatrices, the different slices may overlap. For example, manipulationmatrices that involve interpolating may require the different slices tooverlap by one or more columns and/or rows. In some embodiments, theprocess of FIG. 5 is used to slice the input data matrix into compatibleinput matrix slices. In some embodiments, the processes of FIGS. 3-6 areused to apply a manipulation matrix to an input data matrix. In someembodiments, the matrix processor unit is matrix processor unit 107 ofFIG. 1 and/or matrix processor unit 201 of FIG. 2.

In some embodiments, each row of input data matrix slice 801 is preparedas a vector operand. In the example shown, row 803 is the first row ofinput data matrix slice 801. In some embodiments, each row, such as row803, is prepared as a vector operand and broadcasted to vector units ofa matrix processor unit with corresponding columns of a manipulationmatrix. In various embodiments, the vector operands of the matrixprocessor unit support dimensions larger than a row of the input matrixslice. Padding or zero-value elements can be used to fill out theremaining elements of a vector operand. For example, a matrix processorunit operating on 32-element vector operands receives a vector operandwith the 16 elements of row 803 along with 16 padding elements.Depending on the matrix processor unit and the manipulation matrix, thesize of the matrix data slice row and the number of padding units maychange, as appropriate.

FIG. 9 is a diagram illustrating an example result matrix fromperforming a matrix manipulation operation. Result matrix 900 is a 32×32two-dimensional matrix result determined by multiplying a 32×16 inputdata matrix, such as input data matrix slice 801 of FIG. 8, by a 16×32manipulation matrix, such as up-sampling manipulation matrix 700 of FIG.7. The elements of result matrix 900 are determined by computing thedot-product result of each row of the input data matrix with each columnof the manipulation matrix. Elements 901, 903, 905, 907, and 909 areelements of the first row of result matrix 900. Elements 901, 903, 905,and 907 are elements of the first, second, third, and fourth columns,respectively, of result matrix 900. Element 909 is the element of thethirty-second and last column of result matrix 900.

In some embodiments, element 901 corresponds to the dot-product resultof the first row of the input data matrix with the first column of themanipulation matrix. For example, in some embodiments, element 901corresponds to the dot-product result of the vector operandcorresponding to 16-element row 803 of input matrix slice 801 of FIG. 8with the vector operand corresponding to 16-element column 701 ofup-sampling manipulation matrix 700 of FIG. 7. Similarly, element 903corresponds to the dot-product result of the same row from an input datamatrix with the second column of a manipulation matrix, such as16-element column 703 of up-sampling manipulation matrix 700 of FIG. 7.Elements 901 and 903 have the same value (X_(1,1)) as the element at thefirst row and first column of the input matrix slice. Elements 905 and907 have the same value (X_(1,2)) as the element at the first row andsecond column of the input matrix slice. Their values correspond to thedot-product result of the same row from an input data matrix with thethird and fourth columns, respectively, of a manipulation matrix, suchas 16-element columns 705 and 707, respectively, of up-samplingmanipulation matrix 700 of FIG. 7. For each row of result matrix 900,the elements of a corresponding row from the input data matrix arerepeated twice to up-sample rows of the input data matrix by a factor oftwo. As an additional example, the last element of the first row ofresult matrix 900, element 909, has the same value (X_(1,16)) as theelement at the first row and sixteenth column of the input matrix slice.The value corresponds to the dot-product result of the same row from aninput data matrix with the sixteenth column of a manipulation matrix,such as 16-element column 709 of up-sampling manipulation matrix 700 ofFIG. 7. In some embodiments, the processes of FIGS. 3-6 are used toapply a manipulation matrix to an input data matrix to determine resultmatrix 900. In some embodiments, the matrix processor unit used toperform the matrix manipulation operation is matrix processor unit 107of FIG. 1 and/or matrix processor unit 201 of FIG. 2.

FIG. 10 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation. In FIG. 10, up-sampling manipulation matrix 1000 representsan exemplar 16×31 manipulation matrix for performing up-sampling of a32×16 input data matrix (not shown). The result of multiplying an inputdata matrix by up-sampling manipulation matrix 1000 is to up-sample therows of the input data matrix by nearly a factor of 2 by linearinterpolation. A 32×16 input data matrix is up-sampled to a 32×31 resultmatrix. Between every column of the original 16 columns of the inputdata matrix, an additional column is inserted by averaging neighboringcolumns. Other dimensions for a manipulation matrix for up-sampling maybe appropriate as well. Up-sampling manipulation matrix 1000 isconfigured for a matrix processor unit with at least 31 vector units,where each vector unit takes vector operands with at least 16 elements.Up-sampling manipulation matrix 1000 may be utilized by a 32×32 matrixprocessor unit, where the 32 element vector operands are padded with 16padding or zero-value elements. In some embodiments, the matrixprocessor unit is matrix processor unit 107 of FIG. 1 and/or matrixprocessor unit 201 of FIG. 2. In some embodiments, the processes ofFIGS. 3-6 are used to apply a manipulation matrix to an input datamatrix.

In the example shown, up-sampling manipulation matrix 1000 is a 16×31manipulation matrix with 16 rows and 31 columns. Each column ofup-sampling manipulation matrix 1000 contains either a single elementwith a value of 1.0 or a pair of elements each with a value of 0.5. Allremaining elements of the column have a value of 0.0. Column 1001, thefirst column of up-sampling manipulation matrix 1000, is (1, 0, 0, . . ., 0) and results in the first column of the result matrix being equal tothe first column of the input data matrix. Column 1003, the secondcolumn of up-sampling manipulation matrix 1000, is (0.5, 0.5, 0, . . . ,0) and results in the second column of the result matrix being equal tothe average of the first and second columns of the input data matrix.Column 1005, the third column of up-sampling manipulation matrix 1000,is (0, 1, 0, . . . , 0) and results in the third column of the resultmatrix being equal to the second column of the input data matrix. Column1007, the fourth column of up-sampling manipulation matrix 1000, is (0,0.5, 0.5, . . . , 0) and results in the fourth column of the resultmatrix being equal to the average of the second and third columns of theinput data matrix. This pattern continues until last column 1009. Column1009, the last and thirty-first column of up-sampling manipulationmatrix 1000, is (0, 0, 0, . . . , 1) and results in the last andthirty-first column of the result matrix being equal to the sixteenthcolumn of the input data matrix. The ellipses shown in up-samplingmanipulation matrix 1000 indicate additional elements not shown to fillout the 16×31 manipulation matrix using the described pattern.

The columns of up-sampling manipulation matrix 1000, such as columns1001, 1003, 1005, 1007, and 1009, among others, are each loaded asvector operands into a corresponding vector unit of the matrix processorunit. For example, column 1001 is a 16-element vector that is preparedas a vector operand for a first vector unit. A dot product is determinedusing a row of the input data matrix and the vector operand of column1001 to determine the first element of a row result. Similarly, a dotproduct is determined using the same row of the input data matrix withthe vector operand of column 1003 to determine the second element of therow result. Using the same row of the input data matrix, dot productsare determined with the vector operands of columns 1005 and 1007 todetermine the third and fourth elements, respectively, of the rowresult. The remaining row elements are similarly determined using theremaining columns of up-sampling manipulation matrix 1000. The lastelement of the row result is computed by determining the dot productusing the same row of the input data matrix with the vector operand ofcolumn 1009. Using up-sampling manipulation matrix 1000, each row resulthas 31 elements.

FIG. 11 is a diagram illustrating an example manipulation matrix andcorresponding vector operands for performing a matrix manipulationoperation. In FIG. 11, down-sampling manipulation matrix 1100 representsan exemplar 32×16 manipulation matrix for performing down-sampling of a32×32 input data matrix (not shown). The result of multiplying an inputdata matrix by down-sampling manipulation matrix 1100 is to down-sampleby pooling (or averaging) pairs of two elements in each row. A 32×32input data matrix is down-sampled to a 32×16 result matrix. Each of the16 columns in the result matrix is determined by averaging two columnsof the input data matrix. Other dimensions for a manipulation matrix fordown-sampling may be appropriate as well. Down-sampling manipulationmatrix 1100 is configured for a matrix processor unit with at least 16vector units, where each vector unit takes vector operands with at least32 elements. Down-sampling manipulation matrix 1100 may be utilized by a32×32 matrix processor unit, where only 16 of the total 32 vector unitsare used. In some embodiments, the matrix processor unit is matrixprocessor unit 107 of FIG. 1 and/or matrix processor unit 201 of FIG. 2.In some embodiments, the processes of FIGS. 3-6 are used to apply amanipulation matrix to an input data matrix.

In the example shown, down-sampling manipulation matrix 1100 is a 32×16manipulation matrix with 32 rows and 16 columns. Each column ofdown-sampling manipulation matrix 1100 contains a pair of elements eachwith a value of 0.5. All remaining elements of the column have a valueof 0.0. Column 1101, the first column of down-sampling manipulationmatrix 1100, is (0.5, 0.5, 0, 0, . . . , 0) and results in the firstcolumn of the result matrix being equal to the average of the first andsecond columns of the input data matrix. Column 1103, the second columnof down-sampling manipulation matrix 1100, is (0, 0, 0.5, 0.5, 0, . . ., 0) and results in the second column of the result matrix being equalto the average of the third and fourth columns of the input data matrix.This matrix element pattern continues until last column 1105. Column1105, the last and sixteenth column of down-sampling manipulation matrix1100, is (0, 0, . . . , 0, 0, 0.5, 0.5) and results in the last andsixteenth column of the result matrix being equal to the average of thefifteenth and sixteenth columns of the input data matrix. The ellipsesshown in down-sampling manipulation matrix 1100 indicate additionalelements not shown to fill out the 32×16 manipulation matrix using thedescribed pattern.

The columns of down-sampling manipulation matrix 1100, such as columns1101, 1103, and 1105, among others, are each loaded as vector operandsinto a corresponding vector unit of the matrix processor unit. Forexample, column 1101 is a 32-element vector that is prepared as a vectoroperand for a first vector unit. A dot product is determined using a rowof the input data matrix and the vector operand of column 1101 todetermine the first element of a row result. Similarly, a dot product isdetermined using the same row of the input data matrix with the vectoroperand of column 1103 to determine the second element of the rowresult. The remaining row elements are similarly determined using theremaining columns of down-sampling manipulation matrix 1100. The lastelement of the row result is computed by determining the dot productusing the same row of the input data matrix with the vector operand ofcolumn 1105. Using down-sampling manipulation matrix 1100, each rowresult has 16 elements.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A system, comprising: a memory; and a processingelement in communication with the memory, wherein the processing elementis configured to: receive a processor instruction specifying a datamatrix and a matrix manipulation operation, wherein the processingelement is configured with a capability to selectively perform aplurality of different types of matrix manipulation operation options,the plurality of different types of matrix manipulation operationoptions including a matrix dimension up-sampling type operation optionand a matrix dimension down-sampling type operation option, and thematrix manipulation operation specified by the received processorinstruction is one of the plurality of different types of matrixmanipulation operation options; automatically select from a repositoryof predefined manipulation matrices, a manipulation matrix based on thematrix manipulation operation specified in the processor instruction;and perform a matrix operation using the data matrix and the selectedmanipulation matrix to determine a result matrix.
 2. The system of claim1, wherein the data matrix is retrieved from the memory.
 3. The systemof claim 1, wherein the selected manipulation matrix is retrieved from alocal memory of the processing element.
 4. The system of claim 1,wherein the matrix operation performed is a matrix multiplicationoperation.
 5. The system of claim 1, wherein the result matrix is storedon a storage on the processing element.
 6. The system of claim 1,wherein the processing element is one of a plurality of processingelements configured to operate in parallel.
 7. The system of claim 1,wherein the result matrix is outputted using an output unit included inthe system.
 8. The system of claim 7, wherein the output unit isconfigured to perform multiple duplicative writes to output anup-sampled result matrix.
 9. The system of claim 1, wherein the selectedmanipulation matrix is an up-sampling matrix.
 10. The system of claim 9,wherein the up-sampling matrix is configured to perform a linearinterpolation between row elements.
 11. The system of claim 1, whereinthe selected manipulation matrix is a down-sampling matrix.
 12. Thesystem of claim 1, wherein the processing element includes: a first typeof register configured to store values of a single row of the datamatrix; a group of a second type of registers, wherein each of thesecond type of registers is configured to store values of a differentcolumn of the selected manipulation matrix; and a plurality of vectorcalculation units, wherein each of the plurality of vector calculationunits corresponds to one of the second type of registers, and each ofthe vector calculation units is configured to multiply each value storedin the first type of register with a corresponding value stored in thecorresponding one of the second type of registers and sum togethermultiplication results of the corresponding vector calculation unit toat least in part determine a corresponding element in the result matrixof multiplying the data matrix with the selected manipulation matrix.13. The system of claim 12, wherein the first type of register isconfigured to broadcast contents to each of the plurality of vectorcalculation units.
 14. The system of claim 12, wherein each of theplurality of vector calculation units includes a vector multiply unitand a vector adder unit.
 15. A method, comprising: receiving at aprocessing element a processor instruction specifying a data matrix anda matrix manipulation operation, wherein processing element isconfigured with a capability to selectively perform a plurality ofdifferent types of matrix manipulation operation options, the pluralityof different types of matrix manipulation operation options including amatrix dimension up-sampling type operation option and a matrixdimension down-sampling type operation option, and the matrixmanipulation operation specified by the received processor instructionis one of the plurality of different types of matrix manipulationoperation options; automatically selecting from a repository ofpredefined manipulation matrices, a manipulation matrix based on thematrix manipulation operation specified in the processor instruction;and performing a matrix operation using the data matrix and the selectedmanipulation matrix to determine a result matrix.
 16. The method ofclaim 15, wherein performing the matrix operation includes: loading eachone of a column of the selected manipulation matrix into one of aplurality of vector calculation units; and broadcasting a row of thedata matrix to each of the plurality of vector calculation units. 17.The method of claim 16, wherein performing the matrix operationincludes: for each of the plurality of vector calculation units:multiplying elements of the broadcasted row of the data matrix withcorresponding elements of the corresponding loaded column of theselected manipulation matrix to determine multiplication results; andsumming together the multiplication results of the corresponding vectorcalculation unit to determine a corresponding element of a correspondingrow of the result matrix of multiplying the data matrix with theselected manipulation matrix.
 18. A system, comprising: a memory; and aplurality of processing elements configured to operate in parallel,wherein at least one processing element of the plurality of processingelements is configured to configured to: receive a processor instructionspecifying a data matrix and a matrix manipulation operation, whereinthe processing element is configured with a capability to selectivelyperform a plurality of different types of matrix manipulation operationoptions, the plurality of different types of matrix manipulationoperation options including a matrix dimension up-sampling typeoperation option and a matrix dimension down-sampling type operationoption, and the matrix manipulation operation specified by the receivedprocessor instruction is one of the plurality of different types ofmatrix manipulation operation options; automatically select from arepository of predefined manipulation matrices, a manipulation matrixbased on the matrix manipulation operation specified in the processorinstruction; and perform a matrix operation using the data matrix andthe selected manipulation matrix to determine a result matrix.
 19. Thesystem of claim 18, wherein the selected manipulation matrix is anup-sampling matrix.
 20. The system of claim 18, wherein the at least oneprocessing element includes: a first type of register configured tostore values of a single row of the data matrix; a group of a secondtype of registers, wherein each of the second type of registers isconfigured to store values of a different column of the selectedmanipulation matrix; and a plurality of vector calculation units,wherein each of the plurality of vector calculation units corresponds toone of the second type of registers, and each of the vector calculationunits is configured to multiply each value stored in the first type ofregister with a corresponding value stored in the corresponding one ofthe second type of registers and sum together multiplication results ofthe corresponding vector calculation unit to at least in part determinea corresponding element in the result matrix of multiplying the datamatrix with the selected manipulation matrix.